Assignment 5: Dimensionality reduction

Brief from moodle

Actually, a fairly large selection of statistical methods can be listed under the title “dimensionality reduction techniques”. Most often (nearly always, that is!) the real-world phenomena are multidimensional: they may consist of not just two or three but 5 or 10 or 20 or 50 (or more) dimensions. Of course, we are living only in a three-dimensional (3D) world, so those multiple dimensions may really challenge our imagination. It would be easier to reduce the number of dimensions in one way or another.

We shall now learn the basics of two data science based ways of reducing the dimensions. The principal method here is principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. The most important components can be then used for various purposes, e.g., drawing scatterplots and other fancy graphs that would be quite impossible to achieve with the original variables and too many dimensions.

Multiple correspondence analysis (MCA) and other variations of CA bring us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. The typical graphs show the original classes of the discrete variables on the same “map”, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).

Briefly stated, these methods help to visualize and understand multidimensional phenomena by reducing their dimensionality that may first feel impossible to handle at all.

0. Setup, read the data

library(readr)
library(tibble)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(FactoMineR)
human0 <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)

1. Country names as rownames + summaries

  • Move the country names to rownames (see Exercise 5.5).
  • Show a graphical overview of the data and show summaries of the variables in the data.
  • Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Rownames

human <- column_to_rownames(human0, "Country")

Summary, statistics

Let’s look at the data, starting with some descriptive statistics:

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Based on the difference between mean and median, and the ranges of the variables, Edu.Exp and Parli.FM seem the most normally distributed. Edu2.FM, Labo.FM and Life.Exp are more skewed or have long tails (bdifference between mean and median, and the ranges of the variables). GNI and Mat.Mor are closer to log-normal (or a related distribution), and maybe Ado.Birth as well.

Summary, graphical

corrplot(cor(human), method="circle",type = "upper", tl.pos = "d")
fig. 1.1, correlation matrix of variables

fig. 1.1, correlation matrix of variables

There are definitely strong correlations in the data. We can identify four different sets of variables from this matrix.

First, there is one set of very strongly correlated variables: Mat.Mor, Ado.Birth, Life.Exp, and Edu.Exp. These will likely be very strongly represented in the first principal component of a PCA.

Second, these four variables also correlate very strongly with GNI and Edu2.FM, more strongly than that pair of variables correlate between themselves. Since they correlate less amongst themselves, there seems to be more degrees of freedom here.

Then there is a third group, Labo.FM and Parli.F. They seem much less correlated compared to the other set of variables, however, they are most strongly correlated with each other. Again, seemingly more degrees of freedom in this group.

ggpairs(human, progress = FALSE)
fig. 1.2, distributions and scatterplots of variables

fig. 1.2, distributions and scatterplots of variables

The correlations become much more clear when looking at the scatterplots. Some are clearly linear, like Edu.Exp and Life.Exp. It’s even clearer that GNI might need to be log-transformed, based on the distribution and the scatter plot shapes. Possibly Mat.Mor and Ado.Birth too.

The other distributions look like skewed normal-like distributions, all of them have one big main mode, although there are some interesting concentrations in certain parts of the distributions.

E.g., in Life.Exp, there is a clear plateau from 60 to roughly 65 or so. This may be related to quality of life before and after retirement.

Another example is this curious smaller mode in the Edu2.FM around 0.5, and a subsequent drop after that. This suggests that there are two overlaid populations, one where the mode is a little under 1, another where the mode is around 0.5. I don’t have a hypothesis for what could be causing this difference in behaviors in the two populations of countries, but it’s a very interesting feature of the distribution. Unfortunately we’ve already thrown away the two variables that this variable is based on, which may have given us some clues to what this effect is.

Let’s do some transforms, just for fun

human.log <- human %>% mutate(GNI = log(GNI)) %>% mutate(Mat.Mor = log(Mat.Mor)) %>% mutate(Ado.Birth = log(Ado.Birth))
ggpairs(human.log, progress = FALSE)
fig. 1.3, distributions and scatterplots of variables after log transform

fig. 1.3, distributions and scatterplots of variables after log transform

The scatter plots now seem to form much clearer trend lines, which seems to indicate this was a good idea. Let’s recheck the correlation matrix, just to check that the scale of the data didn’t mix things up too much.

corrplot(cor(human.log), method="circle",type = "upper", tl.pos = "d")
fig. 1.4, correlation matrix using log transformed dataset

fig. 1.4, correlation matrix using log transformed dataset

Well, this does seem to have changed a lot of things. The correlations are now much more even among five variables: Life.Exp, Edu.Exp, GNI, Mat.Mor, and Ado.Birth. The correlation with Edu2.FM is a little weaker, and the Parli.F and Labo.FM variables are again least correlated with the others.

2. PCA with raw data

  • Perform principal component analysis (PCA) on the raw (non-standardized) human data.
  • Show the variability captured by the principal components.
  • Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Since the log transform was not part of the assignment, let’s forget it for now, and do PCA on the raw dataset and look at the coefficients.

pca_raw <- prcomp(human)
pca_raw
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03

PC1 is nearly all GNI, PC2 is nearly all Mat.Mor, and PC3 is nearly all Ado.Birth, etc. This is a problem. Let’s see why by looking at the summary of the transformed rows, which are in pca_raw$x.

summary(pca_raw$x)
##       PC1               PC2               PC3               PC4         
##  Min.   :-105495   Min.   :-858.85   Min.   :-87.306   Min.   :-37.798  
##  1st Qu.:  -6885   1st Qu.: -75.30   1st Qu.:-11.498   1st Qu.: -6.643  
##  Median :   5587   Median :  60.76   Median :  2.681   Median :  1.787  
##  Mean   :      0   Mean   :   0.00   Mean   :  0.000   Mean   :  0.000  
##  3rd Qu.:  13430   3rd Qu.: 117.75   3rd Qu.: 13.592   3rd Qu.:  8.200  
##  Max.   :  17051   Max.   : 200.85   Max.   : 99.552   Max.   : 26.267  
##       PC5                PC6                PC7                PC8          
##  Min.   :-16.0452   Min.   :-4.38647   Min.   :-0.70772   Min.   :-0.43774  
##  1st Qu.: -2.0796   1st Qu.:-1.10824   1st Qu.:-0.08793   1st Qu.:-0.08919  
##  Median :  0.5794   Median : 0.07198   Median : 0.02775   Median :-0.01657  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:  2.3562   3rd Qu.: 1.09716   3rd Qu.: 0.10801   3rd Qu.: 0.08364  
##  Max.   :  7.6214   Max.   : 3.80771   Max.   : 0.80521   Max.   : 0.50052

As we saw earlier, GNI per capita goes from roughly 500 to 100000, while the other variables were all max in the hundreds. Since PCA will maximize the spread along each principal component, we will get some bad decisions, because the distance scales in these variables are not comparable (the dataset needs to be standardized).

If we look at a summary of the pca, this is even more clear:

summary(pca_raw)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

This tells us that PC1 captures nearly all of the variance (>0.9999) in the dataset, which again is due to the dataset not being standardized.

Knowing this, we can’t expect a very good biplot, but let’s plot one anyway.

pca_pr <- round(100*summary(pca_raw)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_raw, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("Negative GNI,", labels[1]), ylab = paste("Maternal mortality,",labels[2]))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
fig. 2.1, biplot for PCA of the raw data

fig. 2.1, biplot for PCA of the raw data

Again, we see that PC1 is pretty much only GNI. This is evident from the fact that the GNI arrow is the only visible one. This can also be inferred from looking at the order of the countries, high GNI countries are to the left, low GNI countries to the right.

I’ve named the principal components according to the variable that they’ve picked up from the dataset (negative GNI and Maternal Mortality), since each of the first few components are almost aligned with one dimension.

3. PCA with standardized variables

  • Standardize the variables in the human data and repeat the above analysis.
  • Interpret the results of both analysis (with and without standardizing).
  • Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

Let’s standardize, do a PCA, and look at the principal components.

pca_scaled <- human %>% scale %>% as.data.frame %>% prcomp
pca_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110  0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424 -0.2625067
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819 -0.1628935
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294 -0.1659678
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675  0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614  0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763  0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293  0.1523699
##                   PC6         PC7         PC8
## Edu2.FM   -0.17713316  0.05773644  0.16459453
## Labo.FM    0.03500707 -0.22729927 -0.07304568
## Life.Exp   0.42242796 -0.43406432  0.62737008
## Edu.Exp    0.38606919  0.77962966 -0.05415984
## GNI       -0.11120305 -0.13711838 -0.16961173
## Mat.Mor   -0.17370039  0.35380306  0.72193946
## Ado.Birth  0.76056557 -0.06897064 -0.14335186
## Parli.F   -0.13749772  0.00568387 -0.02306476

This already looks much better. Let’s see the spreads.

summary(pca_scaled$x)
##       PC1               PC2                PC3                PC4          
##  Min.   :-3.4128   Min.   :-2.79379   Min.   :-2.13797   Min.   :-3.43435  
##  1st Qu.:-1.4651   1st Qu.:-0.61041   1st Qu.:-0.61625   1st Qu.:-0.51644  
##  Median :-0.4934   Median : 0.04166   Median :-0.07731   Median : 0.05376  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.0237   3rd Qu.: 0.68895   3rd Qu.: 0.47541   3rd Qu.: 0.54936  
##  Max.   : 6.0717   Max.   : 3.12552   Max.   : 2.56743   Max.   : 2.27574  
##       PC5                PC6                PC7                PC8         
##  Min.   :-1.54550   Min.   :-1.82764   Min.   :-0.98931   Min.   :-1.0576  
##  1st Qu.:-0.41848   1st Qu.:-0.26156   1st Qu.:-0.29823   1st Qu.:-0.1581  
##  Median :-0.06288   Median :-0.02236   Median :-0.02144   Median : 0.0261  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.33133   3rd Qu.: 0.34286   3rd Qu.: 0.29026   3rd Qu.: 0.1657  
##  Max.   : 2.74409   Max.   : 1.49002   Max.   : 1.13872   Max.   : 1.3844

Ok, the range of the dimensions seem much more sensible now, they are all in the same order of magnitude. Let’s see how much variance each principal component explains.

summary(pca_scaled)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

PC1 explains more than half the variance, not bad! All principal components do seem to capture at least one percent of the variance however, so we weill be losing information if we decide to cut this off.

pca_pr <- round(100*summary(pca_scaled)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")

biplot(pca_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("neg. Human development,", labels[1]), ylab = paste("Gender equality,",labels[2]))
fig. 3.1, biplot for PCA of the scaled data

fig. 3.1, biplot for PCA of the scaled data

Much better, and better yet, all original variables are roughly axis-aligned. I’ve added descriptive labels based on which variables align with which axes, more on these in section 4.

3.1 Interpretation

Raw data vs scaled data

As already discussed, the raw data was a bad fit for PCA due to the different orders of magnitude in the dispersion among the variables. PCA on raw data essentially just picked out one variable at a time in decreasing order of scale.

PCA on the scaled data performs much better.

PC1 is neg. Human development

I’ve decided to name PC1 neg. Human development, inspired by the name of the original dataset. This principal component measures the welfare of the country in terms of health (Mat.Mor, Ado.Birth, Life.Exp), standard of living (GNI per cap., ppp adjusted), and education (Edu.Exp, Edu2.FM). The value of this component is smaller with better outcomes in these domains, which is where the neg. comes in. I would rather take the negative of this PC1 and call it Human development, but this is what the PCA gave us.

PC2 is gender equality

PC2 I’ve called gender equality, because it measures female participation in political decision-making (0.65 * Parli.F) and female participation in the labour market (0.72 * Labo.FM).

This is perhaps not the whole story, because this component has a positive contribution from maternal mortality rates and adolescent birth. Maybe this component only measures whether society has moved away from traditional gender roles. In that case, this can be seen as a cultural liberal/conservative axis.

PC3 is negative female political empowerment, or negative attitudes towards female leadership

PC3 is positively correlated with female participation in political decision-making (0.73 * Parli.F), but negatively correlated with the ratio of female participation in labor markets and secondary education (-0.584 * Labo.FM, -0.24 * Edu2.FM).

Let’s flip this around and consider NPC3 = -PC3, it’s easier to reason about that way. If a country has a high level of female MPs relative to the female education and participation in the labor market, then NPC3 is high.

So this principal component measures how women are viewed as in society. Are they seen as leaders (and elected into parliament)? then PC3 is low. Are they seen as useful in the workforce but not as leaders? then PC3 is high.

PC4 is difficult to interpret

The principal components are getting harder and harder to reason about as we go further down the list.

Roughly, PC4 = 0.62 Edu2.FM - 0.72 GNI - 0.25 Mat.Mor.

These variables don’t seem to make a clear story. Edu2.FM should be very close to 1 for all developed nations, as high school dropouts are rare, and with negative GNI per cap. maybe this axis is about the economic development of the country? The maternal mortality rate is difficult to square with this.

This component might measure the relative focus on wealth as compared to other societal welfare in the country. With heavy emphasis on might.

4. Interpret biplot

  • Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

The principal components

I already touched on this in the previous section, but as the first principal components align with Mat.Mor, Ado.Birth, Life.Exp, GNI, Edu.Exp, and Edu2.FM, it seems to capture the general wellbeing of society. Or rather, the lack thereof, since it’s pointing in the other direction.

The second axis measures gender equality, as it aligns well with Parli.F and Labo.FM.

Social equality trend line

It is good to point out at this point that there is a clear trend line in the upper left corner of the biplot, moving from the right bottom to the top left. It seems like most of the countries are in this trend line. There is then a large spread of the rest of the countries, which is why this trend is not aligned with the principal components. Let’s call this trend line the social equality trend line, as all the countries at the top of this trend line have social welfare programs, such as universal healthcare, strong unions, and low GINI indices. (A bit hand-wavy, but this is a course assignment, not a research paper).

Correlations

There are some additional interesting correlations in the original variables with the PCs. Mat.Mor and Edu.Exp both point slightly in the positive PC2 direction. This doesn’t make a lot of intuitive sense if we consider PC2 to capture gender equality or liberal values. However, Mat.Mor is pointing in the opposite direction from the social equality trend line, which may suggest that the principal components aren’t aligned in any particularily meaningful direction. PCA has maximiced the dispersion along PC1 and PC2, but it does not necessarily mean that the axes are meaningful in themselves, the axes may be slightly misaligned compared to the labels I’ve given them.

This idea is supported by the fact that PC1 is bimodal, so the second, smaller mode might then be pulling the PCA slightly off the trend line. The other mode consists mostly of African countries and other developing nations. The fact that the other variables are distributed differently here might be due to artifacts of colonialism or some other big systemic differences between developed and developing nations.

ggplot(data.frame(pca_scaled$x), aes(x=PC1)) + geom_density()
fig. 3.2, PC1 distribution, clearly bimodal

fig. 3.2, PC1 distribution, clearly bimodal

Parli.F and Labo.FM are pointing in slightly different directions from each other, with Parli.FM slightly aligned with GNI, and Labo.FM completely orthogonal to it. This suggests that female labor participation has no effect on GNI (it doesn’t matter for the economy what the sex of a worker is), but female leadership has a positive correlation with GNI (of course, we don’t have a causal link here, only a correlation).

EXTRA: redoing the PCA with the log-transformed data

Just for fun, let’s use the log transformed data to see if some of those correlations change

pca_log_scaled <- human.log %>% scale %>% as.data.frame %>% prcomp
summary(pca_log_scaled)
biplot(pca_log_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"))
fig. 4.1, biplot for PCA of the standardized and log transformed data

fig. 4.1, biplot for PCA of the standardized and log transformed data

Sure enough, if we look at the summary, we see that the first PC now captures 57% of the variance compared to 53.6% before, a better result. The first four PCs now capture 90% of the variance, after that we get diminishing returns.

pca_log_scaled
## Standard deviations (1, .., p=8):
## [1] 2.1505310 1.1259172 0.8681023 0.7727862 0.5558253 0.4410956 0.3827107
## [8] 0.3267292
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4         PC5
## Edu2.FM   -0.32850815  0.01067147 -0.35894548  0.77560185 -0.36060516
## Labo.FM    0.03263713  0.72942195 -0.61074433 -0.23458714  0.05083415
## Life.Exp  -0.42542035 -0.05622607  0.11141581 -0.03806539  0.26227538
## Edu.Exp   -0.42035073  0.09697456 -0.06818573 -0.03839701  0.44933876
## GNI       -0.43314660 -0.09480875 -0.01735880  0.06836804  0.13860792
## Mat.Mor    0.43672168  0.01162213 -0.02391875  0.21779671 -0.12263401
## Ado.Birth  0.38284906  0.03041796 -0.03340255  0.48515646  0.74256569
## Parli.F   -0.09178670  0.66724454  0.69216873  0.23021937 -0.10502880
##                   PC6         PC7         PC8
## Edu2.FM    0.05763161  0.12121944  0.11626173
## Labo.FM    0.12840535 -0.11338286 -0.08313215
## Life.Exp   0.73375796  0.22248569 -0.38118853
## Edu.Exp   -0.56661984  0.53272731 -0.03187199
## GNI       -0.27117249 -0.74980849 -0.37876162
## Mat.Mor   -0.17790440  0.22051498 -0.81597528
## Ado.Birth  0.11877107 -0.16353812  0.15412247
## Parli.F   -0.03795809 -0.03959254  0.01489879

Looking closer at the coefficients, PC1, PC2, and PC3 have roughly the same interpretation as before. PC4 has changed now, but it is still as difficult to interpret as before.

5. MCA on tea data

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Load the tea dataset and convert its character variables to factors:

tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)

  • Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
  • Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!).
  • Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA.
  • Comment on the output of the plots. (0-4 points)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300  36
#View(tea)

The dataset seems to consist of 300 rows of 36 answers to questionnaire, each row represents one questionnaire.

Summary statistics follow.

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

Factominer documentation (https://rdrr.io/cran/FactoMineR/man/tea.html) says the first 18 vars are about how they drink tea, 19 is age, and the rest are personal questions and “product’s perception” which I’m choosing to interpret as how they think about tea.

5.1 MCA

There are a lot of variables here, so let’s choose a sensible subset of them to look at. Let’s choose the following variables, which I’m interpreting free-hand here, since I can’t find much metadata:

  • Tea: what kind of tea out of three types the respondent prefers (green, black, Earl Grey)
  • price: the price level of the tea that the respondent prefers
  • How: whether the respondent takes tea as is, with lemon, with milk, or in some other way
  • sex: the sex of the respondent
  • SPC: some kind of general social group for the respondent (student, employee (white collar?), middle (management?), senior)
  • age_Q: the age group of the respondent
  • frequency: how often the respondent drinks tea

of these, Tea, price, and How are “active” variables, and the rest are “supplementary” variables.

As a sidenote, this dataset is very badly documented.

tea_filtered <- tea %>% dplyr::select(one_of("Tea", "price", "How", "sex", "SPC", "age_Q", "frequency"))
summary(tea_filtered)
##         Tea                  price        How      sex               SPC    
##  black    : 74   p_branded      : 95   alone:195   F:178   employee    :59  
##  Earl Grey:193   p_cheap        :  7   lemon: 33   M:122   middle      :40  
##  green    : 33   p_private label: 21   milk : 63           non-worker  :64  
##                  p_unknown      : 12   other:  9           other worker:20  
##                  p_upscale      : 53                       senior      :35  
##                  p_variable     :112                       student     :70  
##                                                            workman     :12  
##    age_Q          frequency  
##  +60  :38   +2/day     :127  
##  15-24:92   1 to 2/week: 44  
##  25-34:69   1/day      : 95  
##  35-44:40   3 to 6/week: 34  
##  45-59:61                    
##                              
## 

Let’s look at a summary of the MCA of the dataset based on these specs above.

mca=MCA(tea_filtered,quali.sup=4:7 ,graph=F)
summary(mca)
## 
## Call:
## MCA(X = tea_filtered, quali.sup = 4:7, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.423   0.399   0.383   0.355   0.335   0.330   0.316
## % of var.             12.688  11.967  11.497  10.653  10.043   9.907   9.484
## Cumulative % of var.  12.688  24.655  36.152  46.805  56.848  66.755  76.239
##                        Dim.8   Dim.9  Dim.10
## Variance               0.283   0.271   0.238
## % of var.              8.490   8.142   7.129
## Cumulative % of var.  84.729  92.871 100.000
## 
## Individuals (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1               |  0.720  0.409  0.056 |  1.341  1.503  0.196 |  0.789  0.541
## 2               |  0.782  0.481  0.216 |  0.593  0.294  0.124 |  0.071  0.004
## 3               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 4               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 5               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 6               | -0.357  0.100  0.027 | -0.078  0.005  0.001 |  0.344  0.103
## 7               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 8               |  0.782  0.481  0.216 |  0.593  0.294  0.124 |  0.071  0.004
## 9               |  0.497  0.195  0.083 |  0.222  0.041  0.016 |  0.544  0.257
## 10              |  1.104  0.961  0.443 | -0.678  0.384  0.167 |  0.138  0.016
##                   cos2  
## 1                0.068 |
## 2                0.002 |
## 3                0.542 |
## 4                0.542 |
## 5                0.542 |
## 6                0.025 |
## 7                0.542 |
## 8                0.002 |
## 9                0.099 |
## 10               0.007 |
## 
## Categories (the 10 first)
##                     Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black           |   1.366  36.278   0.611  13.516 |  -0.083   0.143   0.002
## Earl Grey       |  -0.440   9.794   0.348 -10.207 |   0.312   5.225   0.175
## green           |  -0.493   2.106   0.030  -2.996 |  -1.636  24.612   0.331
## p_branded       |  -0.497   6.176   0.115  -5.856 |  -0.218   1.257   0.022
## p_cheap         |   1.116   2.289   0.030   2.982 |   1.313   3.361   0.041
## p_private label |   0.018   0.002   0.000   0.084 |  -0.132   0.102   0.001
## p_unknown       |   0.314   0.310   0.004   1.108 |   2.953  29.139   0.363
## p_upscale       |   1.063  15.724   0.242   8.512 |  -0.874  11.272   0.164
## p_variable      |  -0.188   1.036   0.021  -2.504 |   0.225   1.575   0.030
## alone           |  -0.275   3.870   0.140  -6.477 |  -0.328   5.831   0.199
##                  v.test     Dim.3     ctr    cos2  v.test  
## black            -0.825 |   0.099   0.209   0.003   0.977 |
## Earl Grey         7.240 |  -0.247   3.420   0.110  -5.741 |
## green            -9.947 |   1.224  14.341   0.185   7.443 |
## p_branded        -2.566 |   0.459   5.803   0.098   5.403 |
## p_cheap           3.509 |   0.315   0.201   0.002   0.841 |
## p_private label  -0.626 |   1.039   6.569   0.081   4.928 |
## p_unknown        10.421 |   1.519   8.030   0.096   5.362 |
## p_upscale        -6.999 |   0.310   1.476   0.021   2.483 |
## p_variable        2.999 |  -0.913  27.080   0.497 -12.188 |
## alone            -7.721 |  -0.153   1.330   0.044  -3.614 |
## 
## Categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## Tea             | 0.611 0.359 0.207 |
## price           | 0.324 0.559 0.565 |
## How             | 0.334 0.279 0.378 |
## 
## Supplementary categories (the 10 first)
##                    Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3   cos2
## F               | -0.099  0.014 -2.071 | -0.001  0.000 -0.018 | -0.092  0.012
## M               |  0.145  0.014  2.071 |  0.001  0.000  0.018 |  0.135  0.012
## employee        | -0.153  0.006 -1.307 | -0.104  0.003 -0.891 |  0.027  0.000
## middle          |  0.179  0.005  1.213 |  0.095  0.001  0.645 |  0.033  0.000
## non-worker      |  0.281  0.021  2.531 | -0.214  0.012 -1.927 | -0.010  0.000
## other worker    |  0.221  0.003  1.021 |  0.174  0.002  0.804 |  0.108  0.001
## senior          |  0.226  0.007  1.419 |  0.086  0.001  0.543 |  0.009  0.000
## student         | -0.350  0.037 -3.343 |  0.108  0.004  1.027 | -0.090  0.002
## workman         | -0.327  0.004 -1.154 |  0.167  0.001  0.588 |  0.133  0.001
## +60             |  0.702  0.072  4.624 | -0.316  0.014 -2.078 | -0.033  0.000
##                 v.test  
## F               -1.930 |
## M                1.930 |
## employee         0.229 |
## middle           0.225 |
## non-worker      -0.091 |
## other worker     0.498 |
## senior           0.056 |
## student         -0.861 |
## workman          0.470 |
## +60             -0.218 |
## 
## Supplementary categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## sex             | 0.014 0.000 0.012 |
## SPC             | 0.068 0.020 0.004 |
## age_Q           | 0.130 0.022 0.014 |
## frequency       | 0.011 0.007 0.026 |

5.2 Plotting and interpretation

Let’s start with a plot of the individuals (ind):

par(mfrow=c(2,2))
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)

I see no real clusters, but there seem to be some duplicates. Dimension 1 covers 12.6% of the variance, and dimension 2 covers 12% of the variance, roughly the same. The scatter plot forms a triangle in this space, with a wide base forming around the Dim 1 = 0 and extending to the right.

Now let’s plot the var variables — Tea, price, and How — which the plot function for FactoMineR’s MCA colors black, red, and green, respectively.

plot(mca, graph.type="ggplot", invisible=c("ind","quali.sup","quanti.sup"), cex=0.8, habillage="quali")

Dimension 1 seems to cover: - most strongly (Tea: 0.61), whether the respondent prefers black tea or not (black +1.36, green -0.49, Earl Grey -0.44) - then (How: 0.334), whether the respondenr likes to add things to their tea (alone -0.28, lemon, milk, other all positive +) - and finally, no clear correlation with price, since both upscale and cheap are positive in dimension 1

Dimension 2 seems to cover: - whether the respondent is unlikely to prefer green tea (-1.636) - whether the respondent likes cheap tea (cheap +1.313, upscale -0.874)

The p_unknown category in the price has the highest coefficient here, which makes this one tough to interpret.

Finally, let’s look at the demographics of the individuals and see how they distribute over this space:

plot(mca, graph.type="ggplot", invisible=c("ind","var","quanti.sup"), cex=0.8, habillage="quali")

It seems that age is the best explaining variable for the differences in respondents tea drinking habits (dim 1: 0.130, dim 2: 0.022), followed by SPC, which does encode some of the same information as age, so perhaps it is redundant. If we interpret the dimensions from before we could roughly say that: - older people are more likely to prefer generic black tea to green tea or Earl Grey - younger people are more likely to prefer Earl Grey - green tea is most likely to be preferred by people in the 25-34 age range who are “employees” (office workers?) - older people prefer to add milk, lemon, or something else to their tea

There is also a slight difference in the distribution of answers between sexes: - men more likely to prefer black tea - women more likely to prefer Ear Grey or green tea

The frequency of tea drinking is all over the place and forms no clear trendline in the biplot.